Student : Esteban Ordenes

Post Graduate Program in Data Science and Business Analytics

PGP-DSBA-UTA-Dec20-A

Background & Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

Data Dictionary

Load Libraries

Load the Dataset

Check the shape of the dataset

ID is just an index for the data entry. This column will not be a significant factor in determining customers who have a higher probability of purchasing the loan.

ZIPCode this could contain a lot of zip code information. We can check check how many individual zip codes there are. If they are too many, we can process this column to extract group information.

CCAvg, as defined in the data dictionary represent thousands of dollars. We may need to convert these values.

Fixing the data types

Check the duplicate data. And if any, we should remove it.

Check for missing values

Statistical summary for the dataset

look at different levels in categorical variables

Ref: https://en.wikipedia.org/wiki/ZIP_Code

Exploratory Data Analysis

Univariate Analysis

Observations on Age

Observations on Experience

Observations on Income

Observations on CCAvg

Observations on Mortgage

Observations on Personal_Loan (Dependant Variable)

Observations on Family

Observations on ZipCodeZone

Observations on Education

9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64

Observations on Securities_Account

Observations on CD_Account

Observations on Online

Observations on CreditCard

Bivariate Analysis

0 ID 5000 non-null int64
1 Age 5000 non-null int64
2 Experience 5000 non-null int64
3 Income 5000 non-null int64
4 ZIPCode 5000 non-null category 5 Family 5000 non-null int64
6 CCAvg 5000 non-null float64 7 Education 5000 non-null category 8 Mortgage 5000 non-null int64
9 Personal_Loan 5000 non-null int64
10 Securities_Account 5000 non-null int64
11 CD_Account 5000 non-null int64
12 Online 5000 non-null int64
13 CreditCard 5000 non-null int64

Securities_Account vs Personal_Loan

CD_Account vs Personal_Loan

Customers that cas CD_Account have converted at a 46% rate, versus 7% of customer that do not have a CD_ACCOUNT.

Online vs Personal_Loan

CreditCard vs Personal_Loan

Family vs Personal_Loan

Experience vs Personal_Loan

Income vs Personal_Loan

CCAvg vs Personal_Loan

Mortgage vs Personal_Loan

Correlation between numeric Variables

We will investigate further those valiables that have high correlation with Personal_Loan.

Evaluate the variables that have .15 or higher correlation point vs Personal_Loan:

PairPlot

Multivariate Analysis

Experience vs Age vs Personal_Loan

Income vs Mortgage vs Personal_Loan

Income vs CCAvg vs Personal_Loan

Income vs CD_Account vs Personal_Loan

CD_Account vs Securities_Account vs Personal_Loan

Family vs CCAvg vs Personal_Loan

Outliers Treatment

Data Pre-Processing

Fixing the data types

Split Data

Model building - Logistic Regression

Model Performances

Prediction on training data

Prediction on test data

AUC ROC curve

Optimal threshold

Observation

Test assumptions

Check for multicollinearity

Model Building - with Predictor data set

Split num_feature_set into training and test set

Building Logistic Regression model from statsmodels

Logit Regression Summary

Calculate the odds ratio from the coef using the formula odds ratio=exp(coef)

Calculate the probability from the odds ratio using the formula probability = odds / (1+odds)

Most significant variable

Prediction of the model

Prediction on Train data

Prediction on Test data

AUC ROC Curve

Choosing Optimal threshold

Observation

Conclusions

Model building - Decision Tree

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.
  5. Test the data on test set.

Data preparation

Split data

Build Decision Tree Model (raw)

We only have 10% of positive classes, so if our model marks each sample as negative, then also we'll get 90% accuracy, hence accuracy is not a good metric to evaluate here.

Insights:

Visualizing the Decision Tree (raw)

Identify the key variables

Reducing over fitting - with Pre-Pruning

Using GridSearch for Hyperparameter tuning of our tree model

Visualizing the Decision Tree - after reducing overfitting (pre-pruning)

Identify the key variables

Total impurity of leaves vs effective alphas of pruned tree

To get an idea of what values of ccp_alpha could be appropriate, DecisionTreeClassifier.cost_complexity_pruning_path returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

We will train a decision tree using the effective alphas.

Creating model with 0.006 ccp_alpha

Visualizing the Decision Tree - with post-pruning (0.006 ccp_alpha)

Identify the key variables

Comparing all the decision tree models

Recommendations

End of File